Spam Detection

Note: This document contains AI generated content - code and editing refinement provided by GPT-3 class models.

Spam

Spam

(Click image above for link to youtube video)

Welcome to the Modeling Deep Dive Section of MLPROD 310.

This document is an overview of an example spam detection model, and how one might go about building a pipeline for training and evaluation of a simple spam detector using simple modeling components.

The notebook is broken down into several steps, and outlines some of the options/caveats that are inherent to the step. The notebook selects basic steps that build on simpler linear and neural network architectures, but don’t necessarily suggest that the chosen architectures are the “best” overall.

Introduction

Spam detection is the process of identifying and filtering unwanted or malicious messages, such as emails, SMS, or social media posts. Spam can range from harmless but annoying advertisements to dangerous phishing attempts and malware distribution. Given the vast volume of digital communication, automated spam detection systems are essential for maintaining security, efficiency, and user experience. This article walks through the construction of a spam classifier using open source models and tooling. The article itself is written as a jupyter notebook and rendered to html. The original jupyter notebook is avaialable here

Building an optimal spam detection model involves multiple considerations, including data collection, feature selection, model choice, and evaluation. The decision-making process typically follows these key steps:

  1. Defining the Problem Scope
  • What type of spam needs to be detected (emails, SMS, social media posts, etc.)?

  • What is the acceptable trade-off between false positives (legitimate messages flagged as spam) and false negatives (spam messages that get through)?

  1. Data Collection and Preprocessing
  • Gathering labeled datasets of spam and non-spam messages.

  • Cleaning the data by removing noise, tokenizing text, and handling missing values.

  • Augmenting data with additional signals like sender reputation, message structure, and frequency patterns.

  1. Feature Engineering
  • Extracting relevant features such as word frequency, n-grams, TF-IDF scores, or embeddings from NLP models.

  • Incorporating metadata features (e.g., sender history, link presence, HTML content).

  1. Model Selection
  • Choosing between rule-based systems, classical machine learning models (Naïve Bayes, SVMs, Random Forests), or deep learning approaches (LSTMs, Transformers).

  • Evaluating trade-offs between interpretability, computational cost, and effectiveness.

  1. Training
  • Splitting data into training, validation, and test sets.
  1. Evaluation
  • Using metrics like precision, recall, F1-score, and ROC-AUC to assess performance.

  • Implementing techniques like cross-validation and hyperparameter tuning to optimize the model.

Step 1 : Define Scope

For the purposes of this exercise we’ll look at e-mail messages that contain known spam or non-spam (“ham”) messages.

There’s a variety of techniques that are useful for detecting spam, including

  • TF/IDF
    • Collecting and comparing word and document frequencies of spam/non-spam sources and creating message vectors.
    • Pros : Easy conceptual components, requires understanding of information entropy.
    • Cons : Doesn’t handle typos or unknown words well, normalized vectors can be enormous.
  • Word Embedding (Concatenated)
    • Creating/Using word embeddings such as Word2Vec and concatenating into a message embedding.
    • Pros: 10-100x faster than sentence embeddings. Effective for simple descriptive phases.
    • Cons: Concatenated word embeddings can lose a lot of precision with even a handful of concatenated word vectors.
  • Sentence Embedding
    • Creating/Using sentence embeddings such as SentenceBert into message embeddings.
    • Pros: 10-100x faster than instruction embeddings. Effective for capturing complex semantic sentence structure.
    • Cons: Not nearly as fast as word embeddings, nor are they capable of handling specific instructions.
  • Transformer Embedding (GPT >=3)
    • Creating/Using instruction embeddings from larger models.
    • Pros: Possible to “program” embeddings using special instructions (give instructions that give more clear separation between records based on context).
    • Cons: Much more expensive and time consuming to execute.

Evolution of Text Representation Models: Size and Capabilities

Model Type Typical Size Dimensionality Context Window Training Data Requirements Key Capabilities Limitations
TF-IDF Very small
(KB to MB)
Sparse vectors
(Dictionary size)
Document/corpus level Minimal
(Just the target corpus)
• Simple statistical word importance
• Effective for document classification
• Computationally efficient
• No pre-training required
• No semantic understanding
• No word relationships
• Sparse representation
• Fixed vocabulary
Word Vectors
(word2vec, GloVe)
Small
(100MB-1GB)
Dense vectors
(50-300)
Word level Medium
(1B+ tokens)
• Captures semantic relationships
• Word analogies (king - man + woman = queen)
• Transfer learning for downstream tasks
• Efficient inference
• Static word representations
• No context sensitivity
• No sentence-level understanding
• Word ambiguity issues
Sentence Embeddings
(USE, InferSent, SBERT)
Medium
(1-5GB)
Dense vectors
(512-1024)
Sentence level Large
(10B+ tokens)
• Sentence-level semantics
• Better for similarity tasks
• Cross-lingual capabilities
• Effective for retrieval
• Limited contextual understanding
• Fixed-length representations
• Less effective for long documents
• Limited compositional abilities
Small Transformers
(BERT-base, RoBERTa-base)
Medium
(0.5-1GB)
Contextual vectors
(768-1024)
Limited
(512 tokens)
Very large
(30B+ tokens)
• Contextual word representations
• Bidirectional context
• Strong performance on many NLP tasks
• Fine-tuning capabilities
• Limited context window
• Moderate parameter efficiency
• Training compute requirements
• Still primarily linguistic understanding
Large Transformers
(GPT-3, PaLM, Claude)
Very large
(100GB-1TB+)
Contextual vectors
(2048-12288+)
Large
(8K-100K+ tokens)
Massive
(1T+ tokens)
• Few/zero-shot learning
• Long-range dependencies
• Emergent abilities
• Cross-task generalization
• Natural language generation
• Enormous compute requirements
• Training cost
• Potential for biased outputs
• “Black box” behavior
• Challenging to interpret
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

# Create DataFrame with model data
data = {
    'name': ['TF-IDF', 'Word2Vec', 'GloVe', 'BERT-base', 'RoBERTa', 'GPT-2', 
             'T5-large', 'GPT-3', 'PaLM', 'GPT-4'],
    'year': [2000, 2013, 2014, 2018, 2019, 2019, 2020, 2020, 2022, 2023],
    'size': [0.001, 0.3, 0.5, 0.4, 0.5, 1.5, 3, 175, 540, 1500],  # Size in GB
    'capability': [
        'Basic word importance/Document classification',
        'Word relationships/Word analogies',
        'Global corpus statistics/Improved semantic capture',
        'Contextual representation/Bidirectional understanding',
        'Optimized pre-training/State-of-art on benchmarks',
        'Better text generation/Zero-shot learning',
        'Text-to-text framework/Multi-task learning',
        'Few-shot learning/Complex instructions',
        'Chain-of-thought reasoning/Advanced problem solving',
        'Nuanced reasoning/Multimodal understanding'
    ]
}

df = pd.DataFrame(data)

# Sort by year for proper timeline
df = df.sort_values(by=['year', 'size'])

# Create figure and axis
plt.figure(figsize=(12, 8))
ax = plt.subplot(111)

# Plot with log scale for y-axis to handle the dramatic size differences
ax.semilogy(df['year'], df['size'], marker='o', markersize=10, 
            linewidth=2, color='#2563eb')

# Format y-axis to show values nicely
def size_formatter(x, pos):
    if x < 1:
        return f"{x:.3f}"
    else:
        return f"{int(x) if x == int(x) else x:.1f}"

ax.yaxis.set_major_formatter(FuncFormatter(size_formatter))

# Add annotations for each model
for i, row in df.iterrows():
    # Determine annotation placement (above or below point based on position)
    if row['size'] > 10:
        y_offset = -1.2  # Place below for large models
        va = 'top'
    else:
        y_offset = 1.2  # Place above for small models
        va = 'bottom'
    
    # Add model name
    ax.annotate(
        f"{row['name']}",
        xy=(row['year'], row['size']),
        xytext=(0, 20 * y_offset),
        textcoords="offset points",
        ha='center',
        va=va,
        fontweight='bold',
        fontsize=9,
        bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.9)
    )
    
    # Add capability text in smaller font
    ax.annotate(
        f"{row['capability']}",
        xy=(row['year'], row['size']),
        xytext=(0, 45 * y_offset),
        textcoords="offset points",
        ha='center',
        va=va,
        fontsize=8,
        bbox=dict(boxstyle="round,pad=0.3", fc="#f0f7ff", ec="#c7dbff", alpha=0.9),
        wrap=True
    )

# Add labels and title
plt.xlabel('Year', fontsize=12)
plt.ylabel('Model Size (GB)', fontsize=12)
plt.title('Growth in NLP Model Size (2000-2023)', fontsize=14, fontweight='bold')

# Add grid for better readability (especially with log scale)
plt.grid(True, which="both", ls="-", alpha=0.2)

# Adjust the x-axis to give some padding
x_min, x_max = df['year'].min() - 1, df['year'].max() + 1
plt.xlim(x_min, x_max)

# Add a note about log scale
plt.figtext(0.5, 0.01, 
           "Note: Y-axis uses logarithmic scale to visualize the exponential growth in model size", 
           ha="center", fontsize=9, style='italic')

# Layout adjustment to make space for annotations
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

# Save the figure
plt.savefig('nlp_model_size_growth.png', dpi=300, bbox_inches='tight')

# Show the plot
plt.show()

# Display the data as a table
print("\nNLP Model Size and Capability Data:")
print(df[['name', 'year', 'size', 'capability']].to_string(index=False))


NLP Model Size and Capability Data:
     name  year     size                                            capability
   TF-IDF  2000    0.001         Basic word importance/Document classification
 Word2Vec  2013    0.300                     Word relationships/Word analogies
    GloVe  2014    0.500    Global corpus statistics/Improved semantic capture
BERT-base  2018    0.400 Contextual representation/Bidirectional understanding
  RoBERTa  2019    0.500     Optimized pre-training/State-of-art on benchmarks
    GPT-2  2019    1.500             Better text generation/Zero-shot learning
 T5-large  2020    3.000            Text-to-text framework/Multi-task learning
    GPT-3  2020  175.000                Few-shot learning/Complex instructions
     PaLM  2022  540.000   Chain-of-thought reasoning/Advanced problem solving
    GPT-4  2023 1500.000            Nuanced reasoning/Multimodal understanding
## If you have jupyter lab running already, just uncomment this cell to install what you need.
# !pip install polars sentence-transformers tqdm pyarrow altair ipywidgets pandas matplotlib

Step 2. Install Dependencies

Required Python Packages

The following Python packages are needed for the project:

  • polars – Fast DataFrame library
  • sentence-transformers – Pre-trained models for sentence embeddings
  • tqdm – Progress bar library
  • pyarrow – Apache Arrow for fast data processing
  • altair – Declarative statistical visualization library
  • ipywidgets – Interactive widgets for Jupyter notebooks
  • pandas – Data analysis library
  • matplotlib – Visualization library

We will call out specific libraries and their strengths and weaknesses.

Step 3: Data Collection and Processing

import polars as pl
df = pl.read_csv("https://raw.githubusercontent.com/bigmlcom/python/refs/heads/master/data/spam.csv", separator = "\t")
df.head()
shape: (5, 2)
Type Message
str str
"ham" "Go until jurong point, crazy..…
"ham" "Ok lar... Joking wif u oni..."
"spam" "Free entry in 2 a wkly comp to…
"ham" "U dun say so early hor... U c …
"ham" "Nah I don't think he goes to u…

Step 4: Feature Engineering

Let’s try MiniLM, a lightweight sentence embedding model.

We can create embeddings from the messages, and insert them back into the dataset as a separate column. Note how Jupyter allows one to use typed (f32x384) vectors as a column type. Pandas can not do this, and is one of the reasons Polars is recommended. However, Polars is still very new and is rapidly evolving. Make sure to stay up to date on documentation.

from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
# Create the embeddings and save them efficiently as a numpy array
embeddings = sentence_model.encode(df["Message"].to_numpy())

# Bind the numpy array to the rest of the dataframe
df = df.with_columns([
    pl.Series(embeddings).alias("Message_Embeddings")
])
df.head()
shape: (5, 3)
Type Message Message_Embeddings
str str array[f32, 384]
"ham" "Go until jurong point, crazy..… [-0.016918, -0.038168, … -0.001258]
"ham" "Ok lar... Joking wif u oni..." [-0.013369, -0.04987, … -0.003396]
"spam" "Free entry in 2 a wkly comp to… [-0.015434, 0.063041, … 0.015645]
"ham" "U dun say so early hor... U c … [-0.012308, 0.037198, … -0.003828]
"ham" "Nah I don't think he goes to u… [0.0777, -0.132872, … 0.009034]

Step 5: Model Selection

There’s a number of different approaches we could take, but for the purposes of illustration, one has landed on using a simpler linear model on top of sentence embeddings.

from torch import nn
# Define logistic regression model
class LogisticRegression(nn.Module):
    def __init__(self, input_dim):
        super(LogisticRegression, self).__init__()
        self.linear = nn.Linear(input_dim, 1)  # Single output node
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        return self.sigmoid(self.linear(x))

Step 6: Training

We will define a training loop. Since we are borrowing a sentence embedder with MiniLM, we don’t need to train a new one. However, we will need to train a simpler linear model.

We need to call out some specific choices here:

  • criterion - choice of “loss” function, in this case binary cross entropy (BCE).
  • optimizer - choice of step function, in this case (most cases) adam.
  • epoch - number of complete passes through the data that the training routine has completed.
# This cell contains a mix of AI and Human Generated Code.
import polars as pl
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Convert labels to binary (spam = 1, ham = 0) and keep as a Polars expression
df = df.with_columns((pl.col("Type") == "spam").cast(pl.Int32).alias("Label"))

# Convert Polars columns directly to PyTorch tensors

X = torch.tensor(df["Message_Embeddings"], dtype=torch.float32)  # Embeddings tensor
y = torch.tensor(df["Label"], dtype=torch.float32).unsqueeze(1)  # Labels tensor

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



# Initialize model
input_dim = X.shape[1]  # Get embedding size
model = LogisticRegression(input_dim)

# Define loss and optimizer
criterion = nn.BCELoss()  # Binary cross-entropy loss
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
num_epochs = 100
holdout_metrics = []
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        # Evaluate the model on the holdout
        model.eval()
        with torch.no_grad():
            y_pred = model(X_test)
            holdout_loss = criterion(y_pred, y_test)
            y_pred_labels = (y_pred > 0.5).float()  # Convert probabilities to binary (0 or 1)

        # Convert tensors to NumPy arrays for sklearn metrics
        y_test_np = y_test.numpy()
        y_pred_np = y_pred_labels.numpy()

        # Compute evaluation metrics
        accuracy = accuracy_score(y_test_np, y_pred_np)
        metrics = {
            "epoch" : epoch,
            "loss" : loss.item(),
            "holdout_loss" : holdout_loss.item(),
            "holdout_accuracy" : accuracy,
        }
        holdout_metrics.append(metrics)

metrics = pd.DataFrame(holdout_metrics).set_index("epoch")
metrics
loss holdout_loss holdout_accuracy
epoch
0 0.666757 0.643077 0.878788
10 0.472433 0.468307 0.878788
20 0.363610 0.375695 0.878788
30 0.300856 0.325027 0.878788
40 0.258253 0.291411 0.886364
50 0.226611 0.266079 0.901515
60 0.202605 0.246158 0.924242
70 0.183906 0.230198 0.939394
80 0.168782 0.217127 0.939394
90 0.156178 0.206159 0.939394

Step 7: Evaluation

We come to the all important evaluation step. We need to assess the results and determine if the model is optimized.

  1. We need to compare the loss metrics. What does loss mean, exactly? Does the loss metrics make sense? (Are they going down, is one loss better than the other? Why?)
  2. The holdout accuracy looks like 88%, which seems good. Is it good enough? What are some simple things we can do to make it better just by looking at the chart?
metrics.plot(title="Holdout Training Metrics by Epoch");

Future Steps

At this point, a decision is reached whether to release the model as-is, or think of ways of improving performance.

  1. What are some pieces of information that could be added to the training embedding/vector? How?
  2. What do we need to do to maintain this model? How would we detect model drift?
  3. If we need to change from a binary classification model to a multi-class model, what needs to change? What metrics should be used?

© Copyright 2024 Justin Donaldson. Except where otherwise noted, all rights reserved. The views and opinions on this website are my own and do not represent my current or former employers.